In the rapidly evolving landscape of financial technology, credit scoring remains a cornerstone in determining the creditworthiness of applicants. This paper delves into the realm of predictive modeling, using a dataset provided by our professors, enriched with comprehensive credit scoring information. The overarching objective of our study is to train a model that efficiently and accurately determines the eligibility of individuals for credit approval.

The task set forth by our professors involves a comprehensive journey through four key exercises. The first exercise is centered around a thorough analysis of the dataset. Here, we delve into understanding the underlying structures, identifying critical variables, and most importantly, balancing the dataset to ensure a fair representation of diverse credit scenarios.

Subsequently, in the second exercise, we focus on training and testing a logistic classifier. This step is crucial in establishing a baseline model from which further improvements can be measured.

The third exercise involves enhancing the predictive performance of the model. This stage is particularly challenging and significant, as it entails fine-tuning the model to ensure it captures the nuances of credit scoring with greater accuracy and efficiency.

Finally, the fourth exercise invites us to step into the practical world. Here, we explore the various challenges a company may face if they were to implement our model in a real-world scenario. This exercise not only grounds our theoretical work in reality but also provides valuable insights into the practical implications and considerations in deploying machine learning models in the credit industry.

This paper aims to not only present a robust model for credit scoring but also to contribute to the broader understanding of how machine learning can be effectively utilized in financial decision-making processes.

Load packages and data

## [[1]]
## NULL
## 
## [[2]]
## NULL
## 
## [[3]]
## NULL
## 
## [[4]]
## NULL
## 
## [[5]]
## NULL
## 
## [[6]]
## NULL
## 
## [[7]]
## NULL
## 
## [[8]]
## NULL
## 
## [[9]]
## NULL
## 
## [[10]]
## NULL
## 
## [[11]]
## NULL
## 
## [[12]]
## NULL
## 
## [[13]]
## NULL
## 
## [[14]]
## NULL
## 
## [[15]]
## NULL
## 
## [[16]]
## NULL
## 
## [[17]]
## NULL
## 
## [[18]]
## NULL
## 
## [[19]]
## NULL
## 
## [[20]]
## NULL
## 
## [[21]]
## NULL
## Warning: package 'ggplot2' was built under R version 4.3.1
## Warning: package 'lubridate' was built under R version 4.3.1
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.3     ✔ purrr     1.0.2
## ✔ forcats   1.0.0     ✔ stringr   1.5.0
## ✔ ggplot2   3.4.4     ✔ tibble    3.2.1
## ✔ lubridate 1.9.3     ✔ tidyr     1.3.0
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
## 
## Attaching package: 'reshape2'
## 
## 
## The following object is masked from 'package:tidyr':
## 
##     smiths
## Warning: package 'plotly' was built under R version 4.3.1
## 
## Attaching package: 'plotly'
## 
## The following object is masked from 'package:ggplot2':
## 
##     last_plot
## 
## The following object is masked from 'package:stats':
## 
##     filter
## 
## The following object is masked from 'package:graphics':
## 
##     layout
## Warning: package 'lattice' was built under R version 4.3.1
## 
## Attaching package: 'caret'
## 
## The following object is masked from 'package:purrr':
## 
##     lift
## 
## 
## Attaching package: 'DescTools'
## 
## The following objects are masked from 'package:caret':
## 
##     MAE, RMSE
## Warning: package 'pROC' was built under R version 4.3.1
## Type 'citation("pROC")' for a citation.
## 
## Attaching package: 'pROC'
## 
## The following objects are masked from 'package:stats':
## 
##     cov, smooth, var
## 
## Loaded ROSE 0.0-4
## 
## 
## Attaching package: 'zoo'
## 
## The following objects are masked from 'package:base':
## 
##     as.Date, as.Date.numeric
## 
## 
## ######################### Warning from 'xts' package ##########################
## #                                                                             #
## # The dplyr lag() function breaks how base R's lag() function is supposed to  #
## # work, which breaks lag(my_xts). Calls to lag(my_xts) that you type or       #
## # source() into this session won't work correctly.                            #
## #                                                                             #
## # Use stats::lag() to make sure you're not using dplyr::lag(), or you can add #
## # conflictRules('dplyr', exclude = 'lag') to your .Rprofile to stop           #
## # dplyr from breaking base R's lag() function.                                #
## #                                                                             #
## # Code in packages is not affected. It's protected by R's namespace mechanism #
## # Set `options(xts.warn_dplyr_breaks_lag = FALSE)` to suppress this warning.  #
## #                                                                             #
## ###############################################################################
## 
## Attaching package: 'xts'
## 
## The following objects are masked from 'package:dplyr':
## 
##     first, last
## 
## 
## Attaching package: 'PerformanceAnalytics'
## 
## The following object is masked from 'package:graphics':
## 
##     legend
## 
## 
## Attaching package: 'reshape'
## 
## The following object is masked from 'package:plotly':
## 
##     rename
## 
## The following objects are masked from 'package:reshape2':
## 
##     colsplit, melt, recast
## 
## The following object is masked from 'package:lubridate':
## 
##     stamp
## 
## The following object is masked from 'package:dplyr':
## 
##     rename
## 
## The following objects are masked from 'package:tidyr':
## 
##     expand, smiths
## Warning: package 'knitr' was built under R version 4.3.1
## 
## Attaching package: 'dlookr'
## 
## The following object is masked from 'package:corrr':
## 
##     correlate
## 
## The following objects are masked from 'package:PerformanceAnalytics':
## 
##     kurtosis, skewness
## 
## The following object is masked from 'package:tidyr':
## 
##     extract
## 
## The following object is masked from 'package:base':
## 
##     transform
## 
## corrplot 0.92 loaded
## [[1]]
## [1] "readr"     "stats"     "graphics"  "grDevices" "utils"     "datasets" 
## [7] "methods"   "base"     
## 
## [[2]]
##  [1] "lubridate" "forcats"   "stringr"   "dplyr"     "purrr"     "tidyr"    
##  [7] "tibble"    "ggplot2"   "tidyverse" "readr"     "stats"     "graphics" 
## [13] "grDevices" "utils"     "datasets"  "methods"   "base"     
## 
## [[3]]
##  [1] "lubridate" "forcats"   "stringr"   "dplyr"     "purrr"     "tidyr"    
##  [7] "tibble"    "ggplot2"   "tidyverse" "readr"     "stats"     "graphics" 
## [13] "grDevices" "utils"     "datasets"  "methods"   "base"     
## 
## [[4]]
##  [1] "lubridate" "forcats"   "stringr"   "dplyr"     "purrr"     "tidyr"    
##  [7] "tibble"    "ggplot2"   "tidyverse" "readr"     "stats"     "graphics" 
## [13] "grDevices" "utils"     "datasets"  "methods"   "base"     
## 
## [[5]]
##  [1] "reshape2"  "lubridate" "forcats"   "stringr"   "dplyr"     "purrr"    
##  [7] "tidyr"     "tibble"    "ggplot2"   "tidyverse" "readr"     "stats"    
## [13] "graphics"  "grDevices" "utils"     "datasets"  "methods"   "base"     
## 
## [[6]]
##  [1] "plotly"    "reshape2"  "lubridate" "forcats"   "stringr"   "dplyr"    
##  [7] "purrr"     "tidyr"     "tibble"    "ggplot2"   "tidyverse" "readr"    
## [13] "stats"     "graphics"  "grDevices" "utils"     "datasets"  "methods"  
## [19] "base"     
## 
## [[7]]
##  [1] "caret"     "lattice"   "plotly"    "reshape2"  "lubridate" "forcats"  
##  [7] "stringr"   "dplyr"     "purrr"     "tidyr"     "tibble"    "ggplot2"  
## [13] "tidyverse" "readr"     "stats"     "graphics"  "grDevices" "utils"    
## [19] "datasets"  "methods"   "base"     
## 
## [[8]]
##  [1] "DescTools" "caret"     "lattice"   "plotly"    "reshape2"  "lubridate"
##  [7] "forcats"   "stringr"   "dplyr"     "purrr"     "tidyr"     "tibble"   
## [13] "ggplot2"   "tidyverse" "readr"     "stats"     "graphics"  "grDevices"
## [19] "utils"     "datasets"  "methods"   "base"     
## 
## [[9]]
##  [1] "pROC"      "DescTools" "caret"     "lattice"   "plotly"    "reshape2" 
##  [7] "lubridate" "forcats"   "stringr"   "dplyr"     "purrr"     "tidyr"    
## [13] "tibble"    "ggplot2"   "tidyverse" "readr"     "stats"     "graphics" 
## [19] "grDevices" "utils"     "datasets"  "methods"   "base"     
## 
## [[10]]
##  [1] "ROCR"      "pROC"      "DescTools" "caret"     "lattice"   "plotly"   
##  [7] "reshape2"  "lubridate" "forcats"   "stringr"   "dplyr"     "purrr"    
## [13] "tidyr"     "tibble"    "ggplot2"   "tidyverse" "readr"     "stats"    
## [19] "graphics"  "grDevices" "utils"     "datasets"  "methods"   "base"     
## 
## [[11]]
##  [1] "ROSE"      "ROCR"      "pROC"      "DescTools" "caret"     "lattice"  
##  [7] "plotly"    "reshape2"  "lubridate" "forcats"   "stringr"   "dplyr"    
## [13] "purrr"     "tidyr"     "tibble"    "ggplot2"   "tidyverse" "readr"    
## [19] "stats"     "graphics"  "grDevices" "utils"     "datasets"  "methods"  
## [25] "base"     
## 
## [[12]]
##  [1] "Boruta"    "ROSE"      "ROCR"      "pROC"      "DescTools" "caret"    
##  [7] "lattice"   "plotly"    "reshape2"  "lubridate" "forcats"   "stringr"  
## [13] "dplyr"     "purrr"     "tidyr"     "tibble"    "ggplot2"   "tidyverse"
## [19] "readr"     "stats"     "graphics"  "grDevices" "utils"     "datasets" 
## [25] "methods"   "base"     
## 
## [[13]]
##  [1] "RColorBrewer" "Boruta"       "ROSE"         "ROCR"         "pROC"        
##  [6] "DescTools"    "caret"        "lattice"      "plotly"       "reshape2"    
## [11] "lubridate"    "forcats"      "stringr"      "dplyr"        "purrr"       
## [16] "tidyr"        "tibble"       "ggplot2"      "tidyverse"    "readr"       
## [21] "stats"        "graphics"     "grDevices"    "utils"        "datasets"    
## [26] "methods"      "base"        
## 
## [[14]]
##  [1] "ggcorrplot"   "RColorBrewer" "Boruta"       "ROSE"         "ROCR"        
##  [6] "pROC"         "DescTools"    "caret"        "lattice"      "plotly"      
## [11] "reshape2"     "lubridate"    "forcats"      "stringr"      "dplyr"       
## [16] "purrr"        "tidyr"        "tibble"       "ggplot2"      "tidyverse"   
## [21] "readr"        "stats"        "graphics"     "grDevices"    "utils"       
## [26] "datasets"     "methods"      "base"        
## 
## [[15]]
##  [1] "PerformanceAnalytics" "xts"                  "zoo"                 
##  [4] "ggcorrplot"           "RColorBrewer"         "Boruta"              
##  [7] "ROSE"                 "ROCR"                 "pROC"                
## [10] "DescTools"            "caret"                "lattice"             
## [13] "plotly"               "reshape2"             "lubridate"           
## [16] "forcats"              "stringr"              "dplyr"               
## [19] "purrr"                "tidyr"                "tibble"              
## [22] "ggplot2"              "tidyverse"            "readr"               
## [25] "stats"                "graphics"             "grDevices"           
## [28] "utils"                "datasets"             "methods"             
## [31] "base"                
## 
## [[16]]
##  [1] "corrr"                "PerformanceAnalytics" "xts"                 
##  [4] "zoo"                  "ggcorrplot"           "RColorBrewer"        
##  [7] "Boruta"               "ROSE"                 "ROCR"                
## [10] "pROC"                 "DescTools"            "caret"               
## [13] "lattice"              "plotly"               "reshape2"            
## [16] "lubridate"            "forcats"              "stringr"             
## [19] "dplyr"                "purrr"                "tidyr"               
## [22] "tibble"               "ggplot2"              "tidyverse"           
## [25] "readr"                "stats"                "graphics"            
## [28] "grDevices"            "utils"                "datasets"            
## [31] "methods"              "base"                
## 
## [[17]]
##  [1] "networkD3"            "corrr"                "PerformanceAnalytics"
##  [4] "xts"                  "zoo"                  "ggcorrplot"          
##  [7] "RColorBrewer"         "Boruta"               "ROSE"                
## [10] "ROCR"                 "pROC"                 "DescTools"           
## [13] "caret"                "lattice"              "plotly"              
## [16] "reshape2"             "lubridate"            "forcats"             
## [19] "stringr"              "dplyr"                "purrr"               
## [22] "tidyr"                "tibble"               "ggplot2"             
## [25] "tidyverse"            "readr"                "stats"               
## [28] "graphics"             "grDevices"            "utils"               
## [31] "datasets"             "methods"              "base"                
## 
## [[18]]
##  [1] "reshape"              "networkD3"            "corrr"               
##  [4] "PerformanceAnalytics" "xts"                  "zoo"                 
##  [7] "ggcorrplot"           "RColorBrewer"         "Boruta"              
## [10] "ROSE"                 "ROCR"                 "pROC"                
## [13] "DescTools"            "caret"                "lattice"             
## [16] "plotly"               "reshape2"             "lubridate"           
## [19] "forcats"              "stringr"              "dplyr"               
## [22] "purrr"                "tidyr"                "tibble"              
## [25] "ggplot2"              "tidyverse"            "readr"               
## [28] "stats"                "graphics"             "grDevices"           
## [31] "utils"                "datasets"             "methods"             
## [34] "base"                
## 
## [[19]]
##  [1] "knitr"                "reshape"              "networkD3"           
##  [4] "corrr"                "PerformanceAnalytics" "xts"                 
##  [7] "zoo"                  "ggcorrplot"           "RColorBrewer"        
## [10] "Boruta"               "ROSE"                 "ROCR"                
## [13] "pROC"                 "DescTools"            "caret"               
## [16] "lattice"              "plotly"               "reshape2"            
## [19] "lubridate"            "forcats"              "stringr"             
## [22] "dplyr"                "purrr"                "tidyr"               
## [25] "tibble"               "ggplot2"              "tidyverse"           
## [28] "readr"                "stats"                "graphics"            
## [31] "grDevices"            "utils"                "datasets"            
## [34] "methods"              "base"                
## 
## [[20]]
##  [1] "dlookr"               "knitr"                "reshape"             
##  [4] "networkD3"            "corrr"                "PerformanceAnalytics"
##  [7] "xts"                  "zoo"                  "ggcorrplot"          
## [10] "RColorBrewer"         "Boruta"               "ROSE"                
## [13] "ROCR"                 "pROC"                 "DescTools"           
## [16] "caret"                "lattice"              "plotly"              
## [19] "reshape2"             "lubridate"            "forcats"             
## [22] "stringr"              "dplyr"                "purrr"               
## [25] "tidyr"                "tibble"               "ggplot2"             
## [28] "tidyverse"            "readr"                "stats"               
## [31] "graphics"             "grDevices"            "utils"               
## [34] "datasets"             "methods"              "base"                
## 
## [[21]]
##  [1] "corrplot"             "dlookr"               "knitr"               
##  [4] "reshape"              "networkD3"            "corrr"               
##  [7] "PerformanceAnalytics" "xts"                  "zoo"                 
## [10] "ggcorrplot"           "RColorBrewer"         "Boruta"              
## [13] "ROSE"                 "ROCR"                 "pROC"                
## [16] "DescTools"            "caret"                "lattice"             
## [19] "plotly"               "reshape2"             "lubridate"           
## [22] "forcats"              "stringr"              "dplyr"               
## [25] "purrr"                "tidyr"                "tibble"              
## [28] "ggplot2"              "tidyverse"            "readr"               
## [31] "stats"                "graphics"             "grDevices"           
## [34] "utils"                "datasets"             "methods"             
## [37] "base"
## Rows: 40000 Columns: 17
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr  (5): grade, home_ownership, verification_status, purpose, application_type
## dbl (12): loan_amnt, int_rate, annual_inc, dti, open_acc, revol_bal, revol_u...
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

Excercise 1

1.1 Describe the data. Specifically:

1.1.1: Check and report the structure of the data set.

str(loan_sample)
## spc_tbl_ [40,000 × 17] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
##  $ loan_amnt          : num [1:40000] 10000 16000 6000 9000 24000 8000 3600 5550 2000 10000 ...
##  $ int_rate           : num [1:40000] 11.47 9.49 17.99 9.75 12.59 ...
##  $ grade              : chr [1:40000] "B" "B" "D" "B" ...
##  $ home_ownership     : chr [1:40000] "RENT" "MORTGAGE" "RENT" "MORTGAGE" ...
##  $ annual_inc         : num [1:40000] 35000 110000 40000 54000 66000 ...
##  $ verification_status: chr [1:40000] "Verified" "Not Verified" "Verified" "Verified" ...
##  $ purpose            : chr [1:40000] "debt_consolidation" "debt_consolidation" "home_improvement" "car" ...
##  $ dti                : num [1:40000] 22.05 15.67 17.34 3.58 12.25 ...
##  $ open_acc           : num [1:40000] 15 9 8 11 8 10 8 5 8 14 ...
##  $ revol_bal          : num [1:40000] 10211 10068 3755 1459 29656 ...
##  $ revol_util         : num [1:40000] 31.5 32.7 25.5 22.8 70.6 34.3 36.1 66.1 24.6 75.6 ...
##  $ total_acc          : num [1:40000] 17 21 19 34 21 17 9 12 30 21 ...
##  $ total_rec_int      : num [1:40000] 1574 2364 943 436 2778 ...
##  $ application_type   : chr [1:40000] "Individual" "Individual" "Individual" "Joint App" ...
##  $ tot_cur_bal        : num [1:40000] 17440 35104 19995 8713 29656 ...
##  $ total_rev_hi_lim   : num [1:40000] 32400 30800 14700 6400 42000 28200 21400 6200 17500 30300 ...
##  $ Status             : num [1:40000] 0 0 0 0 0 0 1 0 0 0 ...
##  - attr(*, "spec")=
##   .. cols(
##   ..   loan_amnt = col_double(),
##   ..   int_rate = col_double(),
##   ..   grade = col_character(),
##   ..   home_ownership = col_character(),
##   ..   annual_inc = col_double(),
##   ..   verification_status = col_character(),
##   ..   purpose = col_character(),
##   ..   dti = col_double(),
##   ..   open_acc = col_double(),
##   ..   revol_bal = col_double(),
##   ..   revol_util = col_double(),
##   ..   total_acc = col_double(),
##   ..   total_rec_int = col_double(),
##   ..   application_type = col_character(),
##   ..   tot_cur_bal = col_double(),
##   ..   total_rev_hi_lim = col_double(),
##   ..   Status = col_double()
##   .. )
##  - attr(*, "problems")=<externalptr>
# Checking for NAs
any(is.na(loan_sample))
## [1] FALSE

The result of “FALSE” from the check indicates that there are no missing values (NAs) in the “loan_sample” dataset. So the dataset is provided with full values means that we can work with this dataset without cleaning NA’s.

1.1.2: How many numeric and how many categorical variables are included in the data? What categorical variable has the most levels in it?

# First we have to check whether all variables have been imported in the correct format. 
first_overview <- overview(loan_sample)
plot(first_overview)

It is clearly visible that the columns with the data type character still have to be converted to factors. In addition, the “Status” column is currently still a numeric data type. This must also be converted to factor, as it is a binary variable. To do further analysis this needs to be done to get further data with the same structure to use for our model and analysis.

loan_sample <- loan_sample %>%
  mutate_if(is.character, as.factor) %>% # Convert all character columns to factors
  mutate(Status = as.factor(Status)) # Convert 'Status' column to factor

# Plot to see, if the datatypes are correct
second_overview <- overview(loan_sample)
plot(second_overview)

That looks right, now we can count the number of variables. As we can see, there are no anomalies in this bar chart accordingly, we should be able to proceed with these converted data.

# Count numeric variables in 'loan_sample'
numeric_vars_count <- sum(sapply(loan_sample, is.numeric))

# Count categorical (factor) variables in 'loan_sample'
categorical_vars_count <- sum(sapply(loan_sample, is.factor))

# Print counts of numeric and categorical variables
cat("There are", numeric_vars_count, "numeric variables and", categorical_vars_count, "categorical variables in the dataset.")
## There are 11 numeric variables and 6 categorical variables in the dataset.
categorical_levels <- sapply(loan_sample, function(x) if(is.factor(x)) length(unique(x)) else NA)

# Find the name of the categorical variable with the most levels
cat_var_most_levels <- names(which.max(categorical_levels))

# Find the number of levels for this variable
levels_count <- max(categorical_levels, na.rm = TRUE)

# Print the variable name and the number of levels
cat("The categorical variable with the most levels in the dataset is:", cat_var_most_levels, "with", levels_count, "levels.")
## The categorical variable with the most levels in the dataset is: purpose with 13 levels.

1.1.3: Summarize the variables. Discuss the summary statistics obtained.

# Categorial Label with the most Levels
summary(loan_sample)
##    loan_amnt        int_rate     grade      home_ownership    annual_inc    
##  Min.   : 1000   Min.   : 5.31   A: 7329   MORTGAGE:17701   Min.   :  5000  
##  1st Qu.: 7000   1st Qu.: 9.44   B:13166   OWN     : 3981   1st Qu.: 42000  
##  Median :10075   Median :12.29   C:11842   RENT    :18318   Median : 57000  
##  Mean   :11687   Mean   :12.62   D: 7663                    Mean   : 63400  
##  3rd Qu.:15000   3rd Qu.:15.05                              3rd Qu.: 77000  
##  Max.   :40000   Max.   :27.49                              Max.   :400000  
##                                                                             
##       verification_status               purpose           dti       
##  Not Verified   :14373    debt_consolidation:23342   Min.   : 0.00  
##  Source Verified:16116    credit_card       : 9362   1st Qu.:12.13  
##  Verified       : 9511    other             : 2337   Median :17.60  
##                           home_improvement  : 2099   Mean   :18.22  
##                           major_purchase    :  794   3rd Qu.:23.86  
##                           medical           :  444   Max.   :60.14  
##                           (Other)           : 1622                  
##     open_acc       revol_bal       revol_util       total_acc    
##  Min.   : 1.00   Min.   :    0   Min.   :  0.00   Min.   : 3.00  
##  1st Qu.: 8.00   1st Qu.: 5615   1st Qu.: 34.80   1st Qu.:15.00  
##  Median :10.00   Median : 9818   Median : 52.40   Median :20.00  
##  Mean   :10.31   Mean   :11995   Mean   : 52.17   Mean   :21.27  
##  3rd Qu.:13.00   3rd Qu.:15832   3rd Qu.: 70.00   3rd Qu.:27.00  
##  Max.   :23.00   Max.   :78762   Max.   :121.40   Max.   :57.00  
##                                                                  
##  total_rec_int      application_type  tot_cur_bal     total_rev_hi_lim
##  Min.   :   0.0   Individual:39450   Min.   :     0   Min.   :   300  
##  1st Qu.: 673.2   Joint App :  550   1st Qu.: 25136   1st Qu.: 13000  
##  Median :1342.5                      Median : 53722   Median : 20900  
##  Mean   :1818.5                      Mean   : 99100   Mean   : 24194  
##  3rd Qu.:2432.9                      3rd Qu.:157834   3rd Qu.: 32200  
##  Max.   :8834.9                      Max.   :472573   Max.   :100000  
##                                                                       
##  Status   
##  0:34810  
##  1: 5190  
##           
##           
##           
##           
## 

Our dataset contains information from 40,000 loans, giving us a detailed look into how people handle their finances. On average, folks are requesting loans around $11,687, but it’s interesting to note that the range is pretty wide, going from as little as $1,000 to as much as $40,000. This tells us that people have different needs when it comes to borrowing money.

Now, when it comes to interest rates, the average is about 12.62%, but it’s quite a spread. Some lucky ones are scoring rates as low as 5.31%, while others are dealing with rates as high as 27.49%. It’s a reminder that what you end up with can vary quite a bit.

On the income side of things, the average yearly income for loan applicants is $63,400. But, as expected, there’s a range here too. Some folks are pulling in more, and some are pulling in less. This is just a snapshot of the diversity in people’s financial situations.

We also took a look at the debt-to-income ratio (DTI), which gives us an idea of how much of a person’s income is tied up in paying off debt. On average, it’s sitting at 18.22%. It’s a useful measure for understanding how comfortably people can manage their debt.

Switching gears to credit lines, on average, people have around 10 open credit lines. But, if we look back over time, some have had as many as 21. This gives us insight into how people handle credit – how many cards or loans they’ve had in the past.

The average amount owed on these credit lines is $11,995, and people tend to use about half of their available credit. This shows that folks are using their credit but not maxing it out completely.

When it comes to interest payments, the average is around $1,818. That’s extra money on top of the initial loan amount. Looking at the big picture, the total average outstanding debt across all loans is $99,100. This tells us that, collectively, there’s a significant amount of money being borrowed.

Finally, the average credit limit on people’s credit lines is $24,194. This is the maximum amount the bank allows them to borrow. It’s a cap on their potential debt.

So, diving into these numbers not only gives us a sense of individual financial situations but also paints a broader picture of how people navigate their finances through loans.

1.1.4: Check the levels of the target variable by choosing the appropriate visualization. Is the target variable balanced?

# Plot the distribution of the target variable
ggplot(loan_sample, aes(x = Status, fill = Status)) + 
  geom_bar() +
  scale_fill_brewer(palette = "Set1") + 
  labs(title = 'Distribution of target variable (Status)', x = 'Status', y = 'Count') +
  theme_minimal()

The ‘Status’ variable in our dataset serves as a crucial indicator of credit approval outcomes. It distinctly classifies these outcomes into two levels: ‘0’ represents a ‘No,’ indicating that the credit has not been approved, while ‘1’ signifies a ‘Yes,’ confirming the approval of the credit.

Upon closer inspection, our dataset reveals a notable imbalance in the distribution of these outcomes. Specifically, there is a significantly higher number of instances where credits were not approved (Status ‘0’) compared to those that were approved (Status ‘1’). This imbalance is a critical aspect to consider, especially in the context of machine learning model development.

In machine learning, an unbalanced dataset can introduce bias into the predictive model. In our case, the imbalance may lead the model to overemphasize the majority class (non-approved credits), potentially resulting in a classifier that tends to predict a ‘No’ more frequently.

To mitigate this issue, it becomes essential to address the class imbalance before training our predictive model. Balancing the dataset involves techniques such as oversampling the minority class (approved credits), undersampling the majority class (non-approved credits), or using more advanced methods like synthetic data generation. By doing so, we aim to create a more equitable representation of both outcomes in our training data.

1.1.5: Check the distribution of the numeric variables in the data set (include different visual representations).

# Histograms
loan_sample %>% 
  select_if(is.numeric) %>% 
  gather(key = "variable", value = "value") %>%
  ggplot(aes(x = value)) +
  geom_histogram(bins = 30) +
  facet_wrap(~variable, scales = 'free_x') +
  theme_minimal()

The histogram we created for the variable open_acc has gaps. We should examine this more closely.

# Create boxplot of the variable total_acc
ggplot(loan_sample, aes(x = open_acc)) +
  geom_histogram(bins = 22) + # Adjust the number of bins as necessary
  labs(x = "Open Accounts", y = "Count") +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 90, hjust = 1))

Since the variable ranges from 1 to 22, using 30 bins leads to empty spaces. Adjusting the bins to match the actual value range has provided a clearer, gap-free histogram.

# Boxplots
loan_sample %>% 
  select_if(is.numeric) %>% 
  gather(key = "variable", value = "value") %>%
  ggplot(aes(y = value)) +
  geom_boxplot() +
  facet_wrap(~variable, scales = 'free') +
  theme_minimal()

1.2: Investigating outliers with a box plot

Elaborate your view on how to proceed in dealing with the outliers and – if necessary – take appropriate action.

# Reshape the data to long format and scale the numeric values
loan_sample_long_scaled <- loan_sample %>%
  select(where(is.numeric)) %>%
  mutate(across(everything(), scale)) %>%
  pivot_longer(cols = everything(), names_to = "variable", values_to = "value")

# Create a boxplot with the scaled values
ggplot(loan_sample_long_scaled, aes(x = variable, y = value)) +
  geom_boxplot() +
  labs(x = "", y = "Scaled Value") +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 90, hjust = 1, vjust = 0.5))

The box plot effectively captures the distribution of scaled numeric variables, and the presence of outliers is evident as individual points outside the box plot. Notably, several variables exhibit outliers, indicating the potential presence of extreme values in our dataset.

To gain a deeper understanding, it is essential to contextualize these outliers. Are they genuine extreme values reflective of the population, or do they stem from errors in data collection or measurement discrepancies? This distinction is crucial in determining the appropriate course of action.

Understanding the impact of outliers on our analysis is paramount. Outliers can significantly influence statistical measures, potentially skewing measures of central tendency and affecting the overall spread of the data. For instance, they may impact the accuracy of mean values and standard deviations.

A closer look at the box plot reveals specific variables that seem to have more pronounced outliers. Exploring these outliers in the context of the variables they pertain to could reveal valuable insights and guide subsequent steps in the analysis.

Various statistical techniques, such as calculating z-scores or leveraging the interquartile range (IQR), can systematically identify outliers. Implementing these techniques allows for a more rigorous and objective assessment of outlier presence.

When deciding how to address outliers, removal or correction is a common approach. However, alternative strategies exist, including transforming the data, winsorizing (capping extreme values), or stratifying analyses based on the presence of outliers. Each strategy comes with its considerations and potential implications for the overall analysis.

In summary, recognizing and addressing outliers is a nuanced process that requires careful consideration of their nature, impact, and appropriate handling strategies to ensure the integrity of our analysis.

# Diagnosing outliers and selecting variables with more than 1% outliers
variables_with_high_outlier_ratio <- loan_sample %>%
  diagnose_outlier() %>%
  filter(outliers_ratio > 1) %>%
  pull(variables)

# Plotting the outliers for these variables
loan_sample %>%
  plot_outlier(one_of(variables_with_high_outlier_ratio))

As we worked on refining our model, we initially thought removing more outliers would make it better, so we decided to keep 95% of our data and discard the top 5% as outliers. Surprisingly, this move didn’t improve the model; in fact, it made it less precise.

Realizing that less might be more in this case, we changed our approach. Instead of removing 5%, we opted for a more cautious strategy, cutting out only 1% of outliers. This way, we’re holding onto 99% of our data, hoping to strike a balance between cleaning up noisy data and keeping enough information for the model to learn effectively.

This back-and-forth shows that finding the right balance in handling outliers is a bit of a puzzle, and we’re trying to crack it to make our model as accurate and reliable as possible.

# Function for adjusting the values to the quantile limits
cap_values_at_quantiles <- function(x) {
    quantiles <- quantile(x, c(0.01, 0.99), na.rm = TRUE)
    x <- ifelse(x < quantiles[1], quantiles[1], x)
    x <- ifelse(x > quantiles[2], quantiles[2], x)
    x
}

loan_sample_truncated <- loan_sample %>%
  mutate(across(where(is.numeric), cap_values_at_quantiles))

1.3: Create a new balanced data set where the two levels of the target variable will be equally represented; Create a bar plot of the newly created target variable. Why is this step necessary?

loan_sample_balanced <- ovun.sample(Status ~ ., data=loan_sample_truncated, method = "under")
loan_sample_under <- data.frame(loan_sample_balanced[["data"]])

# Plot the balance
ggplot(loan_sample_under, aes(x = Status, fill = Status, group = Status)) +
  geom_bar() +
  ylab("Count") +
  xlab("Status of the loan") +
  scale_fill_brewer(palette = "Set1") +
  theme_minimal() +
  ggtitle("New balanced dataset")

Balancing the dataset is key for training models, especially when dealing with imbalanced classes like predicting loan default. This ensures a fair representation of both outcomes, preventing the model from favoring the more common class. By striking this balance, we aim to improve the model’s accuracy and fairness in predicting both common and less frequent events, contributing to a more reliable and equitable performance.

1.4: Investigating the distribution of the numeric features per the two levels of our target feature

(i.e. default vs non-default). Discuss the visualizations. Which variables seem to be relevant in predicting the target feature?

# Define columns to exclude (these are column with characters)
exclude_columns <- c(3, 4, 6, 7, 14)

# Loop through columns, excluding the specified ones
for (i in setdiff(1:ncol(loan_sample_under), exclude_columns)) {
  # Only plot numeric columns
  if (is.numeric(loan_sample_under[[i]])) {
    print(ggplot(loan_sample_under, aes_string(y = names(loan_sample_under)[i], color = "Status")) +
            geom_boxplot() +
            ylab(names(loan_sample_under[i])) +
            theme(axis.title.x = element_blank(),
                  axis.text.x = element_blank(),
                  axis.ticks.x = element_blank()))
  }
}
## Warning: `aes_string()` was deprecated in ggplot2 3.0.0.
## ℹ Please use tidy evaluation idioms with `aes()`.
## ℹ See also `vignette("ggplot2-in-packages")` for more information.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.

While looking at the boxplots, it’s clear that some variables don’t show much difference between the ‘Status’ ‘0’ and ‘1’ categories. Notably, interest rate, annual income, total revenue, and DTI stand out with significant differences between the two statuses. However, manually creating each boxplot can be time-consuming, and interpreting them might lack a systematic approach.

To tackle this, we decided on a more comprehensive method by combining manual boxplot analysis with insights from the Boruta algorithm. This hybrid approach blends both manual examination and algorithmic assistance, aiming for a nuanced understanding of the data and highlighting potential correlations. By doing this, we aim for a well-rounded outcome, drawing on the strengths of both manual and automated analyses to gain a more robust understanding of the dataset’s relationships.

This hybrid model allows for a more informed interpretation, ensuring that the observed deviations in specific variables are thoroughly considered within the broader dataset context. It represents a holistic and balanced approach, bringing together human expertise and algorithmic rigor to deepen our understanding of the correlation patterns in our data.

1.5: Boruta algorithm

In the next step, we run the Boruta algorithm.The Boruta algorithm provides a more objective method for determining the importance of features as it is based on repeated, random comparisons rather than human judgment.

##  1. run of importance source...
##  2. run of importance source...
##  3. run of importance source...
##  4. run of importance source...
##  5. run of importance source...
##  6. run of importance source...
##  7. run of importance source...
##  8. run of importance source...
##  9. run of importance source...
##  10. run of importance source...
##  11. run of importance source...
## After 11 iterations, +21 secs:
##  confirmed 12 attributes: annual_inc, dti, grade, home_ownership, int_rate and 7 more;
##  rejected 1 attribute: application_type;
##  still have 3 attributes left.
##  12. run of importance source...
##  13. run of importance source...
##  14. run of importance source...
##  15. run of importance source...
## After 15 iterations, +29 secs:
##  confirmed 2 attributes: open_acc, verification_status;
##  still have 1 attribute left.
##  16. run of importance source...
##  17. run of importance source...
##  18. run of importance source...
##  19. run of importance source...
##  20. run of importance source...
##  21. run of importance source...
##  22. run of importance source...
##  23. run of importance source...
##  24. run of importance source...
##  25. run of importance source...
##  26. run of importance source...
##  27. run of importance source...
##  28. run of importance source...
##  29. run of importance source...
##  30. run of importance source...
##  31. run of importance source...
##  32. run of importance source...
##  33. run of importance source...
##  34. run of importance source...
##  35. run of importance source...
##  36. run of importance source...
##  37. run of importance source...
##  38. run of importance source...
##  39. run of importance source...
##  40. run of importance source...
##  41. run of importance source...
##  42. run of importance source...
##  43. run of importance source...
##  44. run of importance source...
##  45. run of importance source...
##  46. run of importance source...
##  47. run of importance source...
##  48. run of importance source...
##  49. run of importance source...
##  50. run of importance source...
##  51. run of importance source...
##  52. run of importance source...
##  53. run of importance source...
##  54. run of importance source...
##  55. run of importance source...
##  56. run of importance source...
##  57. run of importance source...
##  58. run of importance source...
##  59. run of importance source...
##  60. run of importance source...
##  61. run of importance source...
##  62. run of importance source...
##  63. run of importance source...
##  64. run of importance source...
##  65. run of importance source...
##  66. run of importance source...
##  67. run of importance source...
##  68. run of importance source...
##  69. run of importance source...
##  70. run of importance source...
##  71. run of importance source...
##  72. run of importance source...
##  73. run of importance source...
##  74. run of importance source...
##  75. run of importance source...
##  76. run of importance source...
##  77. run of importance source...
##  78. run of importance source...
##  79. run of importance source...
##  80. run of importance source...
##  81. run of importance source...
##  82. run of importance source...
##  83. run of importance source...
##  84. run of importance source...
##  85. run of importance source...
##  86. run of importance source...
##  87. run of importance source...
##  88. run of importance source...
##  89. run of importance source...
##  90. run of importance source...
##  91. run of importance source...
##  92. run of importance source...
##  93. run of importance source...
##  94. run of importance source...
##  95. run of importance source...
##  96. run of importance source...
##  97. run of importance source...
##  98. run of importance source...
##  99. run of importance source...
##  [1] "loan_amnt"           "int_rate"            "grade"              
##  [4] "home_ownership"      "annual_inc"          "verification_status"
##  [7] "purpose"             "dti"                 "open_acc"           
## [10] "revol_bal"           "revol_util"          "total_acc"          
## [13] "total_rec_int"       "tot_cur_bal"         "total_rev_hi_lim"

With the Boruta method providing valuable insights, the next step in our analysis involves a closer examination of multicollinearity. Multicollinearity arises when one predictor variable in a multiple regression model can be accurately predicted from the others. This phenomenon can lead to skewed or misleading coefficient estimates and draw potentially erroneous conclusions.

In essence, the presence of multicollinearity introduces a level of redundancy among predictor variables, making it challenging to isolate their individual effects on the response variable. When high correlations exist between predictors, it becomes difficult for the regression model to distinguish the unique contributions of each variable, potentially compromising the accuracy and reliability of the model.

Identifying and addressing multicollinearity is critical for refining the integrity of our regression analysis. It allows us to disentangle the interdependencies between predictor variables and ensures that each variable’s impact on the response variable is accurately reflected. By navigating the complexities of multicollinearity, we aim to enhance the robustness of our regression model, providing more accurate and trustworthy insights into the relationships between predictors and the target variable.

1.6: Visualize the correlations that emerge between the numerical features. Discuss the results. Which variables are highly correlated? Decide whether you keep all variables.

correlations = cor(loan_sample_under[-c(3,4,6,7,14,17)])
corrplot(correlations) 

In addition to exploring correlations, we delve deeper into their statistical significance. Utilizing the cor_pmat function, we gain the ability to explicitly assess the significance of each correlation, with a distinct focus on highlighting the insignificant ones.

This method goes beyond merely identifying correlations; it allows us to scrutinize the strength and reliability of each correlation coefficient. By explicitly marking correlations that are deemed statistically insignificant, we enhance our ability to distinguish between robust relationships and those that may occur by chance.

The cor_pmat function becomes a valuable tool in this context, offering a visual representation that guides us in pinpointing correlations that warrant closer scrutiny. This nuanced approach not only emphasizes the presence of correlations but also refines our understanding by discerning the statistical significance associated with each, providing a more comprehensive evaluation of the interplay between variables.

p_value_mat <- cor_pmat(loan_sample_under[,-c(3,4,6,7,14,17)])
ggcorrplot(correlations, type = "lower", p.mat = p_value_mat) 

When we look at how variables in our dataset relate to each other, those with a high correlation, close to 1 or -1, are seen as strongly linked. It suggests that knowing one variable might pretty much tell us about the other, hinting at a kind of redundant or closely connected relationship.

Conversely, when the correlation is closer to 0, it means the relationship is weaker. It could be moderate or just a faint connection between variables. This indicates that understanding one variable might not necessarily give us a good prediction for another.

In our analysis, we checked correlation plots and got insights from the Boruta algorithm. Interestingly, despite seeing varying strengths in correlations, we’ve decided to keep all the variables. The reason behind this choice is that even if the relationships are not super strong, each variable brings in unique and valuable information.

The idea here is to appreciate the mix of both strong and not-so-strong connections between variables. Keeping all the variables allows us to have a more complete understanding of the dataset, considering the nuances and diverse information each variable brings to the table.

1.7: Use a bar plot visualization to investigate the associations between the categorical variables and the target feature.

# Generate a bar plot for each categorical variable
categorical_vars <- c("home_ownership", "verification_status", "purpose", "addr_state", "application_type")
# Loop over categorical variables and plot using aes() and without aes_string()
for (cat_var in categorical_vars) {
  # Check if the column exists to avoid errors
  if (!cat_var %in% names(loan_sample_under)) {
    message(paste("Skipping", cat_var, "as it is not found in the dataset."))
    next
  }
  
  # Create the plot
  plot <- loan_sample_under %>%
    group_by(.data[[cat_var]], Status) %>%
    summarise(Count = n(), .groups = 'drop') %>%
    ggplot(aes(x = .data[[cat_var]], y = Count, fill = as.factor(Status))) +
    geom_bar(stat = "identity", position = position_dodge()) +
    labs(title = paste("Bar plot of", cat_var, "by Status"), x = cat_var, y = "Count") +
    scale_fill_discrete(name = "Status") +
    theme_minimal() +
    theme(axis.text.x = element_text(angle = 90, hjust = 1)) # Rotate x labels for readability
   # Print the plot
  print(plot)
}

## Skipping addr_state as it is not found in the dataset.

Hier Text schreiben!

1.7: Plot an interactive scatter plot of the association between the loan amount requested and the annual income of the borrower. Discuss the plot. What can you tell about the association?

library(plotly)

# Assuming your data is in a dataframe called loan_sample and
# the relevant columns are named loan_amnt and annual_inc
plot_ly(data = loan_sample_under, x = ~loan_amnt, y = ~annual_inc, 
        type = 'scatter', mode = 'markers', 
        marker = list(size = 10, opacity = 0.5)) %>%
  layout(title = 'Association between Loan Amount and Annual Income',
         xaxis = list(title = 'Loan Amount Requested'),
         yaxis = list(title = 'Annual Income of Borrower'))

As we examine the density plot, the regions with high density unveil common combinations of income and loan amounts, offering insights into potential standard loan products or characteristic borrower profiles. The clustering of data points in these areas indicates prevalent patterns that could be indicative of well-established loan products catering to specific income brackets or borrower demographics.

Notably, the observation of a wide range of incomes accompanied by a relatively narrow range of loan amounts prompts considerations about the sensitivity of loan amounts to income variations. This pattern may imply that, beyond a certain income threshold, the loan amount becomes less sensitive to further income increases. In other words, there seems to be a saturation point where increases in income may not significantly impact the loan amount, suggesting a potential cap or standardization in the lending process.

Understanding these density patterns not only sheds light on the typical borrower scenarios but also hints at the underlying dynamics between income and loan amounts. It invites further exploration into the factors influencing loan decisions and provides valuable insights for refining lending strategies or tailoring financial products to meet the diverse needs of different income segments.

Excercise 2 Train and test a logistic classifier.

Hier Text schreiben !!! ### 2.1: Dividing the sample into training and testing set

# Set seed for reproducibility
set.seed(7)

# Split the data into training (70%) and testing (30%) sets
splitIndex <- createDataPartition(loan_sample_under$Status, p = 0.7, list = FALSE)
training_set <- loan_sample_under[splitIndex,]
testing_set <- loan_sample_under[-splitIndex,]

Hier Text schreiben!

PercTable(loan_sample_under$Status)
##                
##     freq   perc
##                
## 0  5'191  50.0%
## 1  5'190  50.0%
PercTable(training_set$Status)
##                
##     freq   perc
##                
## 0  3'634  50.0%
## 1  3'633  50.0%
PercTable(testing_set$Status)
##                
##     freq   perc
##                
## 0  1'557  50.0%
## 1  1'557  50.0%

2.2: Train the classifier and report the coefficients obtained and interpret the results.

In the next step, we train the logit model. In terms of our inputs i.e. our Xs, we use all variables included in the data_new_under apart from the status, which is our Y. How would you interpret the results printed from the summary fit1?

## 
## Call:
## glm(formula = Status ~ ., family = binomial(), data = training_set)
## 
## Coefficients:
##                                      Estimate Std. Error z value Pr(>|z|)    
## (Intercept)                        -2.502e+00  3.188e-01  -7.846 4.29e-15 ***
## loan_amnt                           6.021e-05  6.042e-06   9.965  < 2e-16 ***
## int_rate                            1.124e-01  1.531e-02   7.341 2.12e-13 ***
## gradeB                              3.985e-01  1.049e-01   3.798 0.000146 ***
## gradeC                              5.648e-01  1.367e-01   4.133 3.58e-05 ***
## gradeD                              6.147e-01  2.014e-01   3.053 0.002268 ** 
## home_ownershipOWN                   6.345e-03  9.334e-02   0.068 0.945806    
## home_ownershipRENT                  2.215e-01  6.796e-02   3.259 0.001119 ** 
## annual_inc                         -5.566e-06  1.183e-06  -4.706 2.52e-06 ***
## verification_statusSource Verified  1.267e-01  5.950e-02   2.130 0.033164 *  
## verification_statusVerified         7.239e-02  6.820e-02   1.061 0.288522    
## purposecredit_card                 -5.311e-02  2.425e-01  -0.219 0.826621    
## purposedebt_consolidation          -3.014e-02  2.385e-01  -0.126 0.899433    
## purposehome_improvement             1.459e-01  2.608e-01   0.560 0.575775    
## purposehouse                       -7.137e-01  4.187e-01  -1.705 0.088253 .  
## purposemajor_purchase              -1.038e-02  2.969e-01  -0.035 0.972113    
## purposemedical                      9.694e-02  3.387e-01   0.286 0.774748    
## purposemoving                      -6.495e-01  3.769e-01  -1.723 0.084845 .  
## purposeother                       -1.591e-01  2.552e-01  -0.623 0.533091    
## purposerenewable_energy             9.113e-01  9.181e-01   0.993 0.320917    
## purposesmall_business               3.667e-01  3.500e-01   1.048 0.294720    
## purposevacation                     8.011e-02  3.429e-01   0.234 0.815283    
## purposewedding                     -1.836e+00  8.557e-01  -2.145 0.031959 *  
## dti                                 1.412e-02  3.644e-03   3.875 0.000107 ***
## open_acc                            4.295e-02  9.351e-03   4.593 4.36e-06 ***
## revol_bal                          -1.064e-05  7.926e-06  -1.342 0.179469    
## revol_util                          1.363e-03  1.988e-03   0.686 0.492961    
## total_acc                          -8.441e-03  3.917e-03  -2.155 0.031175 *  
## total_rec_int                      -2.410e-04  2.285e-05 -10.545  < 2e-16 ***
## application_typeJoint App           1.403e-01  2.058e-01   0.682 0.495248    
## tot_cur_bal                        -7.800e-07  3.778e-07  -2.065 0.038950 *  
## total_rev_hi_lim                    1.483e-06  4.517e-06   0.328 0.742687    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 10074.2  on 7266  degrees of freedom
## Residual deviance:  9186.6  on 7235  degrees of freedom
## AIC: 9250.6
## 
## Number of Fisher Scoring iterations: 4

We can print out only the significant variables with p-value lower than 0.05. We notice that 9 variables are found statistically significant.

##  [1] "loan_amnt"                          "int_rate"                          
##  [3] "gradeB"                             "gradeC"                            
##  [5] "gradeD"                             "home_ownershipRENT"                
##  [7] "annual_inc"                         "verification_statusSource Verified"
##  [9] "purposewedding"                     "dti"                               
## [11] "open_acc"                           "total_acc"                         
## [13] "total_rec_int"                      "tot_cur_bal"

2.3: Plot the ROC and the Precision/Recall Curve and interpret the results.

Next, we aim to evaluate the predictive performance of our model. To do so, we will plot the ROC curve.

testing_set$fit1_score <- predict(fit1,type='response',testing_set)
fit1_pred <- prediction(testing_set$fit1_score, testing_set$Status)
fit1_roc <- performance(fit1_pred, "tpr", "fpr")
plot(fit1_roc, lwd=1, colorize = TRUE, main = "Fit1: Logit - ROC Curve")
lines(x=c(0, 1), y=c(0, 1), col="black", lwd=1, lty=3)

In the subsequent step of our analysis, we delve into the intricacies of our model by visualizing the Precision/Recall Curve. This curve serves as a comprehensive illustration of the delicate balance between the true positive rate and the positive predictive value across various probability thresholds.

As we navigate through different probability thresholds, the Precision/Recall Curve provides a nuanced perspective on the model’s performance. It vividly captures the trade-offs inherent in our predictive model, offering valuable insights into how adjusting the probability threshold impacts both precision and recall.

Precision, representing the accuracy of positive predictions, and recall, measuring the model’s ability to capture all actual positive instances, are pivotal metrics in evaluating classifier performance. By exploring this curve, we gain a deeper understanding of the model’s behavior across a spectrum of probability thresholds, enabling us to make informed decisions about the optimal threshold for our specific use case.

This visual representation not only aids in model evaluation but also guides us in fine-tuning our predictive approach for achieving the desired balance between precision and recall, aligning with the specific objectives and requirements of our analysis.

fit1_precision <- performance(fit1_pred, measure = "prec", x.measure = "rec")
plot(fit1_precision, main="Fit1: Logit - Precision vs Recall")

Hier noch genau erklären und herausfinden warum es ein Sprung macht. Erklärung von Chat GPT: Die Kurve kann Unregelmäßigkeiten in der Verteilung der Klassifizierungswahrscheinlichkeiten widerspiegeln. Ein kleiner Satz von Beispielen, die sehr sicher als positiv klassifiziert werden, gefolgt von einer größeren Gruppe von Beispielen mit weniger Sicherheit, könnte zu einem solchen Sprung führen.

2.4: Produce the confusion matrix and interpret the results.

# Extract the confusion matrix 
cm <- confusionMatrix(as.factor(round(testing_set$fit1_score)), testing_set$Status)
print(cm)
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   0   1
##          0 997 561
##          1 560 996
##                                           
##                Accuracy : 0.64            
##                  95% CI : (0.6229, 0.6569)
##     No Information Rate : 0.5             
##     P-Value [Acc > NIR] : <2e-16          
##                                           
##                   Kappa : 0.28            
##                                           
##  Mcnemar's Test P-Value : 1               
##                                           
##             Sensitivity : 0.6403          
##             Specificity : 0.6397          
##          Pos Pred Value : 0.6399          
##          Neg Pred Value : 0.6401          
##              Prevalence : 0.5000          
##          Detection Rate : 0.3202          
##    Detection Prevalence : 0.5003          
##       Balanced Accuracy : 0.6400          
##                                           
##        'Positive' Class : 0               
## 

2.5 Report the AUC values and the overall accuracy and interpret the results.

# AUC
fit1_auc <- performance(fit1_pred, measure = "auc")

# Overall accuracy
accuracy <- sum(diag(cm$table)) / sum(cm$table)

# Print of the values
cat("AUC: ", fit1_auc@y.values[[1]]*100, "\nOverall Accuracy: ", accuracy)
## AUC:  69.61645 
## Overall Accuracy:  0.6400128

Excercise 3

Thinking about the pre-processing steps that you carried out before training the logistic classifier: Can you think of a way to improve the predictive performance of your data?

3.1.1: Training the logistic classifier without any preprocessing and feature selection stepps

# Set seed for reproducibility
set.seed(7)

# Splitting the data into testing and training data
splitIndex2 <- createDataPartition(loan_sample$Status, p = 0.7, list = FALSE)
training_set2 <- loan_sample[splitIndex2,]
testing_set2 <- loan_sample[-splitIndex2,]

# Percentage table of Status with out preprocessing
PercTable(loan_sample$Status)
##                
##     freq   perc
##                
## 0 34'810  87.0%
## 1  5'190  13.0%
PercTable(training_set2$Status)
##                
##     freq   perc
##                
## 0 24'367  87.0%
## 1  3'633  13.0%
PercTable(testing_set2$Status)
##                
##     freq   perc
##                
## 0 10'443  87.0%
## 1  1'557  13.0%
# Fitting the model
fit2 <- glm(Status ~ ., data=training_set2,family=binomial())
summary(fit2)
## 
## Call:
## glm(formula = Status ~ ., family = binomial(), data = training_set2)
## 
## Coefficients:
##                                      Estimate Std. Error z value Pr(>|z|)    
## (Intercept)                        -4.312e+00  2.390e-01 -18.041  < 2e-16 ***
## loan_amnt                           6.705e-05  4.327e-06  15.496  < 2e-16 ***
## int_rate                            1.032e-01  9.869e-03  10.452  < 2e-16 ***
## gradeB                              3.546e-01  8.463e-02   4.190 2.79e-05 ***
## gradeC                              6.056e-01  1.012e-01   5.984 2.18e-09 ***
## gradeD                              6.676e-01  1.414e-01   4.721 2.35e-06 ***
## home_ownershipOWN                   6.147e-02  6.938e-02   0.886 0.375618    
## home_ownershipRENT                  2.356e-01  5.079e-02   4.638 3.51e-06 ***
## annual_inc                         -5.235e-06  8.955e-07  -5.846 5.04e-09 ***
## verification_statusSource Verified  7.956e-02  4.464e-02   1.782 0.074712 .  
## verification_statusVerified         7.838e-02  5.021e-02   1.561 0.118519    
## purposecredit_card                 -6.694e-02  1.891e-01  -0.354 0.723338    
## purposedebt_consolidation          -6.613e-02  1.859e-01  -0.356 0.722028    
## purposehome_improvement             1.180e-01  2.022e-01   0.584 0.559546    
## purposehouse                       -2.357e-01  3.278e-01  -0.719 0.472001    
## purposemajor_purchase               5.665e-03  2.253e-01   0.025 0.979936    
## purposemedical                      2.148e-01  2.394e-01   0.897 0.369624    
## purposemoving                      -7.206e-01  3.088e-01  -2.334 0.019604 *  
## purposeother                        1.113e-02  1.971e-01   0.056 0.954971    
## purposerenewable_energy             5.033e-01  5.596e-01   0.899 0.368417    
## purposesmall_business               2.876e-01  2.463e-01   1.168 0.242874    
## purposevacation                     3.334e-01  2.538e-01   1.313 0.189048    
## purposewedding                     -8.367e-02  5.370e-01  -0.156 0.876169    
## dti                                 1.374e-02  2.605e-03   5.273 1.34e-07 ***
## open_acc                            3.992e-02  6.667e-03   5.989 2.11e-09 ***
## revol_bal                          -6.500e-08  5.694e-06  -0.011 0.990891    
## revol_util                          1.746e-03  1.437e-03   1.215 0.224529    
## total_acc                          -1.007e-02  2.862e-03  -3.519 0.000433 ***
## total_rec_int                      -2.345e-04  1.591e-05 -14.742  < 2e-16 ***
## application_typeJoint App           8.834e-02  1.401e-01   0.631 0.528206    
## tot_cur_bal                        -5.014e-07  2.871e-07  -1.747 0.080667 .  
## total_rev_hi_lim                   -6.259e-06  3.384e-06  -1.850 0.064351 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 21611  on 27999  degrees of freedom
## Residual deviance: 19941  on 27968  degrees of freedom
## AIC: 20005
## 
## Number of Fisher Scoring iterations: 5
# Print out only the significant variables with p-value lower than 0.05.
significant.variables2 <- summary(fit2)$coeff[-1,4] < 0.05
names(significant.variables2)[significant.variables2 == TRUE]
##  [1] "loan_amnt"          "int_rate"           "gradeB"            
##  [4] "gradeC"             "gradeD"             "home_ownershipRENT"
##  [7] "annual_inc"         "purposemoving"      "dti"               
## [10] "open_acc"           "total_acc"          "total_rec_int"
testing_set2$fit2_score <- predict(fit2,type='response',testing_set2)
fit2_pred <- prediction(testing_set2$fit2_score, testing_set2$Status)
fit2_roc <- performance(fit2_pred, "tpr", "fpr")
plot(fit2_roc, lwd=1, colorize = TRUE, main = "Fit2: Logit - ROC Curve")
lines(x=c(0, 1), y=c(0, 1), col="black", lwd=1, lty=3)

fit2_precision <- performance(fit2_pred, measure = "prec", x.measure = "rec")
plot(fit2_precision, main="Fit2: Logit - Precision vs Recall")

# Extract the confusion matrix 
cm2 <- confusionMatrix(as.factor(round(testing_set2$fit2_score)), testing_set2$Status)
print(cm2)
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction     0     1
##          0 10414  1534
##          1    29    23
##                                           
##                Accuracy : 0.8698          
##                  95% CI : (0.8636, 0.8757)
##     No Information Rate : 0.8702          
##     P-Value [Acc > NIR] : 0.5714          
##                                           
##                   Kappa : 0.0204          
##                                           
##  Mcnemar's Test P-Value : <2e-16          
##                                           
##             Sensitivity : 0.99722         
##             Specificity : 0.01477         
##          Pos Pred Value : 0.87161         
##          Neg Pred Value : 0.44231         
##              Prevalence : 0.87025         
##          Detection Rate : 0.86783         
##    Detection Prevalence : 0.99567         
##       Balanced Accuracy : 0.50600         
##                                           
##        'Positive' Class : 0               
## 
# AUC
fit2_auc <- performance(fit2_pred, measure = "auc")

# Overall accuracy
accuracy2 <- sum(diag(cm2$table)) / sum(cm2$table)

# Print of the values
cat("AUC: ", fit2_auc@y.values[[1]]*100, "\nOverall Accuracy: ", accuracy2)
## AUC:  69.35202 
## Overall Accuracy:  0.86975

3.1.2: Training the logistic classifier without the collumn application type

# Set seed for reproducibility
set.seed(7)

# Deleting column application_type
loan_sample_under_new <- loan_sample_under[-14]

# Splitting the data into testing and training data
splitIndex3 <- createDataPartition(loan_sample_under_new$Status, p = 0.7, list = FALSE)
training_set3 <- loan_sample_under_new[splitIndex3,]
testing_set3 <- loan_sample_under_new[-splitIndex3,]

# Percentage table of Status with out preprocessing
PercTable(loan_sample_under_new$Status)
##                
##     freq   perc
##                
## 0  5'191  50.0%
## 1  5'190  50.0%
PercTable(training_set3$Status)
##                
##     freq   perc
##                
## 0  3'634  50.0%
## 1  3'633  50.0%
PercTable(testing_set3$Status)
##                
##     freq   perc
##                
## 0  1'557  50.0%
## 1  1'557  50.0%
# Fitting the model
fit3 <- glm(Status ~ ., data=training_set3,family=binomial())
summary(fit3)
## 
## Call:
## glm(formula = Status ~ ., family = binomial(), data = training_set3)
## 
## Coefficients:
##                                      Estimate Std. Error z value Pr(>|z|)    
## (Intercept)                        -2.503e+00  3.188e-01  -7.852 4.09e-15 ***
## loan_amnt                           6.064e-05  6.010e-06  10.090  < 2e-16 ***
## int_rate                            1.127e-01  1.530e-02   7.364 1.78e-13 ***
## gradeB                              3.981e-01  1.049e-01   3.794 0.000148 ***
## gradeC                              5.639e-01  1.367e-01   4.127 3.68e-05 ***
## gradeD                              6.123e-01  2.014e-01   3.041 0.002358 ** 
## home_ownershipOWN                   5.751e-03  9.334e-02   0.062 0.950871    
## home_ownershipRENT                  2.216e-01  6.796e-02   3.260 0.001114 ** 
## annual_inc                         -5.612e-06  1.181e-06  -4.752 2.01e-06 ***
## verification_statusSource Verified  1.261e-01  5.949e-02   2.120 0.034004 *  
## verification_statusVerified         7.273e-02  6.820e-02   1.067 0.286189    
## purposecredit_card                 -5.463e-02  2.424e-01  -0.225 0.821709    
## purposedebt_consolidation          -3.023e-02  2.385e-01  -0.127 0.899133    
## purposehome_improvement             1.451e-01  2.608e-01   0.557 0.577858    
## purposehouse                       -7.130e-01  4.189e-01  -1.702 0.088708 .  
## purposemajor_purchase              -1.169e-02  2.968e-01  -0.039 0.968574    
## purposemedical                      1.040e-01  3.384e-01   0.307 0.758693    
## purposemoving                      -6.480e-01  3.768e-01  -1.720 0.085432 .  
## purposeother                       -1.589e-01  2.552e-01  -0.623 0.533444    
## purposerenewable_energy             9.099e-01  9.179e-01   0.991 0.321555    
## purposesmall_business               3.653e-01  3.499e-01   1.044 0.296470    
## purposevacation                     7.889e-02  3.429e-01   0.230 0.818059    
## purposewedding                     -1.838e+00  8.560e-01  -2.147 0.031786 *  
## dti                                 1.429e-02  3.635e-03   3.931 8.45e-05 ***
## open_acc                            4.268e-02  9.342e-03   4.568 4.91e-06 ***
## revol_bal                          -1.072e-05  7.925e-06  -1.353 0.176182    
## revol_util                          1.342e-03  1.988e-03   0.675 0.499482    
## total_acc                          -8.411e-03  3.917e-03  -2.147 0.031764 *  
## total_rec_int                      -2.420e-04  2.280e-05 -10.615  < 2e-16 ***
## tot_cur_bal                        -7.642e-07  3.770e-07  -2.027 0.042670 *  
## total_rev_hi_lim                    1.499e-06  4.516e-06   0.332 0.739938    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 10074.2  on 7266  degrees of freedom
## Residual deviance:  9187.1  on 7236  degrees of freedom
## AIC: 9249.1
## 
## Number of Fisher Scoring iterations: 4
# Print out only the significant variables with p-value lower than 0.05.
significant.variables3 <- summary(fit3)$coeff[-1,4] < 0.05
names(significant.variables3)[significant.variables3 == TRUE]
##  [1] "loan_amnt"                          "int_rate"                          
##  [3] "gradeB"                             "gradeC"                            
##  [5] "gradeD"                             "home_ownershipRENT"                
##  [7] "annual_inc"                         "verification_statusSource Verified"
##  [9] "purposewedding"                     "dti"                               
## [11] "open_acc"                           "total_acc"                         
## [13] "total_rec_int"                      "tot_cur_bal"
testing_set3$fit3_score <- predict(fit3,type='response',testing_set3)
fit3_pred <- prediction(testing_set3$fit3_score, testing_set3$Status)
fit3_roc <- performance(fit3_pred, "tpr", "fpr")
plot(fit3_roc, lwd=1, colorize = TRUE, main = "Fit3: Logit - ROC Curve")
lines(x=c(0, 1), y=c(0, 1), col="black", lwd=1, lty=3)

fit3_precision <- performance(fit3_pred, measure = "prec", x.measure = "rec")
plot(fit3_precision, main="Fit3: Logit - Precision vs Recall")

# Extract the confusion matrix 
cm3 <- confusionMatrix(as.factor(round(testing_set3$fit3_score)), testing_set3$Status)
print(cm3)
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    0    1
##          0 1001  560
##          1  556  997
##                                           
##                Accuracy : 0.6416          
##                  95% CI : (0.6245, 0.6585)
##     No Information Rate : 0.5             
##     P-Value [Acc > NIR] : <2e-16          
##                                           
##                   Kappa : 0.2832          
##                                           
##  Mcnemar's Test P-Value : 0.9284          
##                                           
##             Sensitivity : 0.6429          
##             Specificity : 0.6403          
##          Pos Pred Value : 0.6413          
##          Neg Pred Value : 0.6420          
##              Prevalence : 0.5000          
##          Detection Rate : 0.3215          
##    Detection Prevalence : 0.5013          
##       Balanced Accuracy : 0.6416          
##                                           
##        'Positive' Class : 0               
## 
# AUC
fit3_auc <- performance(fit3_pred, measure = "auc")

# Overall accuracy
accuracy3 <- sum(diag(cm3$table)) / sum(cm3$table)

# Print of the values
cat("AUC: ", fit3_auc@y.values[[1]]*100, "\nOverall Accuracy: ", accuracy3)
## AUC:  69.62428 
## Overall Accuracy:  0.6416185

3.2: What can you do differently? (hint: Feel free to be creative and discuss any additional step in data collection and/or data pre-processing that you might try so to improve the results)

Excercise 4

Finally, thinking about putting your model into action and basing credit decisions on the prediction that it generates: • What kind of challenges may a company face if it would use your model in their daily business, in particular in regard to ethical challenges and moral obligations companies have? Please refer to the „common ethical issues in the context the creation of value from data” (see slides week 11) in your answer. • Can you think of a way how companies can overcome or at least mitigate the issues that you described above?